The Corpus and the Lexicon: Standardising Deep Lexical Acquisition Evaluation

نویسندگان

  • Yi Zhang
  • Timothy Baldwin
  • Valia Kordoni
چکیده

This paper is concerned with the standardisation of evaluation metrics for lexical acquisition over precision grammars, which are attuned to actual parser performance. Specifically, we investigate the impact that lexicons at varying levels of lexical item precision and recall have on the performance of pre-existing broad-coverage precision grammars in parsing, i.e., on their coverage and accuracy. The grammars used for the experiments reported here are the LinGO English Resource Grammar (ERG; Flickinger (2000)) and JACY (Siegel and Bender, 2002), precision grammars of English and Japanese, respectively. Our results show convincingly that traditional Fscore-based evaluation of lexical acquisition does not correlate with actual parsing performance. What we argue for, therefore, is a recall-heavy interpretation of F-score in designing and optimising automated lexical acquisition algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ULex: new data models and a mobile environment for corpus enrichment

The Ubiquitous Lexicon concept (ULex) has two sides. In the first kind of ubiquity, ULex combines prelexical corpus based lexicon extraction and formatting techniques from speech technology and corpus linguistics for both language documentation and basic speech technology (e.g. speech synthesis), and proposes new XML models for the basic datatypes concerned, in order to enable standardisastion ...

متن کامل

The Automatic Acquisition of Verb Subcategorisations and Their Impact on the Performance of an HPSG Parser

We describe the automatic acquisition of a lexicon of verb subcategorisations from a domain-specific corpus, and an evaluation of the impact this lexicon has on the performance of a “deep”, HPSG parser of English. We conducted two experiments to determine whether the empirically extracted verb stems would enhance the lexical coverage of the grammar and to see whether the automatically extracted...

متن کامل

Deep Lexical Acquisition of Type Properties in Low-resource Languages: A Case Study in Wambaya

We present a case study on applying common methods for the prediction of lexical properties to a low-resource language, namely Wambaya. Leveraging a small corpus leads to a typical high-precision, low-recall system; using the Web as a corpus has no utility for this language, but a machine learning approach seems to utilise the available resources most effectively. This motivates a semi-supervis...

متن کامل

Multiword Lexical Acquisition And Dictionary Formalization

In this paper, we present the current state of development of a large-scale lexicon built at LabEL1 for Portuguese. We will concentrate on multiword expressions (MWE), particularly on multiword nouns, (i) illustrating their most relevant morphological features, and (ii) pointing out the methods and techniques adopted to generate the inflected forms from lemmas. Moreover, we describe a corpus-ba...

متن کامل

Corpus-Based Induction of Lexical Representation and Meaning

The acquisition of linguistic knowledge, i.e., the identication, extraction, and encoding of linguistic information in a corpus, has been one of the main motivations for data-driven approaches to natural language. Methods have been developed for the acquisition of, for instance, parts of speech, noun compounds, collocations, support verbs, subcategorization frames, phrase structure rules, selec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007